Skip to content

feat: async recheck support #1062

Open
technicallyty wants to merge 7 commits intomainfrom
technicallyty/STACK-2402-krakatoa-recheck
Open

feat: async recheck support #1062
technicallyty wants to merge 7 commits intomainfrom
technicallyty/STACK-2402-krakatoa-recheck

Conversation

@technicallyty
Copy link
Contributor

@technicallyty technicallyty commented Mar 10, 2026

Description

we recently updated comet to no longer lock on recheck, pushing concurrency responsibility to the application.

changes:

  • route CheckTx through the app-side mempool insert worker instead of BaseApp.runTx
  • fix Cosmos rechecked tx tracking to replace entries by signer/nonce identity rather than pointer identity. the previous code was not allowing cosmos fee replacement txs

Closes: STACK-2455


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • tackled an existing issue or discussed with a team member
  • left instructions on how to review the changes
  • targeted the main branch

@linear
Copy link

linear bot commented Mar 10, 2026

@technicallyty technicallyty changed the title recheck support feat: async recheck support Mar 10, 2026
Comment on lines +17 to +20
// TODO: do we even do recheck anymore?
if request.Type == abci.CheckTxType_Recheck {
return &abci.ResponseCheckTx{Code: abci.CodeTypeOK}, nil
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can expect this to never be called with this mempool, yes? cc @mattac21

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@aljo242
Copy link
Contributor

aljo242 commented Mar 10, 2026

@greptile whats the scoop man

@technicallyty
Copy link
Contributor Author

@greptile

@vladjdk
Copy link
Member

vladjdk commented Mar 10, 2026

@greptile hello

@greptile-apps
Copy link

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR introduces async recheck support by routing all CheckTx calls through the app-side insert worker goroutine instead of BaseApp.runTx, and fixes Cosmos tx replacement in CosmosTxStore to match by signer/nonce identity rather than pointer identity. A new check-tx-timeout config field (default 30 s) bounds how long the handler blocks waiting for an insert result.

Key changes:

  • NewCheckTxHandler now decodes the raw tx bytes and calls mempool.Insert with a bounded context, returning the insert error directly to CometBFT
  • CosmosTxStore.AddTx grows a secondary keys map[string]int keyed on the canonical signer/nonce tuple so that fee-replacement txs correctly overwrite the old entry
  • GetCheckTxTimeout / EVMMempoolCheckTxTimeout flag / MempoolConfig.CheckTxTimeout wire the new configurable timeout end-to-end

Issue found:

  • The handler does not check request.Type before processing, so CheckTxType_Recheck requests are treated identically to new transactions. Since the tx is already in the pool, mempool.Insert returns ErrAlreadyKnown (non-OK), and CometBFT interprets a non-OK recheck response as the transaction becoming invalid — potentially evicting valid transactions silently. The included TestRecheckIsNoOp test will also fail for this reason (it sends malformed bytes expecting CodeTypeOK, but the decoder returns an error → non-OK response).

Confidence Score: 2/5

  • Not safe to merge — the missing recheck type guard will cause TestRecheckIsNoOp to fail in CI and can silently evict valid transactions from CometBFT's mempool during normal block processing.
  • The core logic change (routing CheckTx through the insert worker) is sound and the CosmosTxStore replacement fix is correct. However, the handler unconditionally processes CheckTxType_Recheck through the insert path — meaning every recheck for an already-known transaction will return ErrAlreadyKnown (non-OK), causing CometBFT to evict those transactions. This is a runtime correctness bug that would manifest during normal validator operation, not just in tests.
  • mempool/check_tx.go requires a CheckTxType_Recheck early-return guard before the decode and insert logic.

Important Files Changed

Filename Overview
mempool/check_tx.go New async handler routes all CheckTx requests — including Recheck — through the insert worker. Missing CheckTxType_Recheck guard will cause TestRecheckIsNoOp to fail and can silently evict valid transactions during recheck.
mempool/tx_store.go Adds signer/nonce-keyed replacement logic to CosmosTxStore via a new keys map. Logic looks correct for single- and multi-signer txs; the store is rebuilt per block so bounded growth is not a concern.
server/server_app_options.go GetCheckTxTimeout allows 0 to pass through (guards only against < 0), while NewCheckTxHandler panics on timeout <= 0. A user setting check-tx-timeout = "0s" passes GetCheckTxTimeout but panics the node at startup. (Note: previously flagged thread item.)
mempool/check_tx_test.go New test suite covering EVM and Cosmos CheckTx paths. TestRecheckIsNoOp expects CodeTypeOK for a recheck with malformed bytes — this will fail without a Recheck type guard in the handler.
server/config/config.go Adds CheckTxTimeout to MempoolConfig with a 30s default and Validate() enforcement (rejects <= 0). Also removes HistoricalGRPCAddressBlockRange propagation from SDK config — unrelated to this PR but a silent behaviour change.
evmd/mempool.go Passes Trace() and GetCheckTxTimeout to NewCheckTxHandler. Wiring is correct; panic risk flows from the GetCheckTxTimeout/handler mismatch noted elsewhere.

Sequence Diagram

sequenceDiagram
    participant C as CometBFT
    participant H as CheckTxHandler
    participant D as TxDecoder
    participant W as InsertWorker
    participant MP as ExperimentalEVMMempool

    C->>H: RequestCheckTx (New or Recheck)
    Note over H: ⚠️ Type not checked — Recheck<br/>takes same path as New
    H->>D: TxDecoder(request.Tx)
    alt decode error
        D-->>H: error
        H-->>C: ResponseCheckTx (non-OK)
    else decode success
        D-->>H: sdk.Tx
        H->>H: context.WithTimeout(Background, timeout)
        H->>MP: Insert(ctx, tx)
        MP->>W: send insertRequest{tx, errC}
        W->>MP: process tx (validate, add to pool)
        W-->>MP: errC <- result
        MP-->>H: err (nil or ErrAlreadyKnown etc.)
        alt insert OK
            H-->>C: ResponseCheckTx{Code: OK}
        else insert error (e.g. ErrAlreadyKnown on recheck)
            H-->>C: ResponseCheckTx (non-OK)
            Note over C: Interprets non-OK recheck as<br/>invalid tx → evicts from mempool
        end
    end
Loading

Last reviewed commit: 4fe1466

Copy link
Contributor

@mattac21 mattac21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also rebase this off of main? the tests are pretty broken on feat/krakatoa but if you go to main you should be able to get all of the system unit and integration tests passing


ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
errC, err := mempool.insert(ctx, tx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the mempool.Insert function already waits on the errC or ctx to be done like you are doing here. could we use that instead of using the private mempool.insert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 18 to +19
index map[sdk.Tx]int
keys map[string]int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keys is now serving the same purpose as index right? we can remove index now I think

s.mu.Lock()
defer s.mu.Unlock()

if key, ok := cosmosTxKey(tx); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm thinking through this, im not actually sure this replacement is safe to do without rechecking all txs for this account, when we replace a tx we may want to just remove all txs > replaced nonce from the tx store for this account and then let them just be included in the next block.

I think this isn't safe because we would need to recheck all txs after the replaced one onto of the state of this new tx, which we are not doing. for example if we recheck txs 4 5 and 6 of an an account and include them in the tx store, then someone can replace tx 4 with a completely different tx that may have invalidated 5 and 6, but we are not rechecking those against the new tx 4's context, which may then cause the proposal to be invalid.

I think this is an issue for evm txs in the legacypool too that we need to address.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working on a separate PR to address this issue in both evm and cosmos tx stores

select {
case err := <-errC:
if err != nil {
return sdkerrors.ResponseCheckTxWithEvents(err, gInfo.GasWanted, gInfo.GasUsed, anteEvents, false), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were the anteEvents also always nil here? we are missing out on events now as well with this? should we modify the response of mempool.Insert to return some of this info?

Copy link
Contributor Author

@technicallyty technicallyty Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i checked on v0.53.x and main, calling broadcast tx sync never returned anything other than code, tx hash, and a log if it failed.

you can see comet stripping down the response here: https://github.com/cometbft/cometbft/blob/1bb8b386fc366bc4655dede0535e16d1ad669c7d/rpc/core/mempool.go#L58-L64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, tested on gaia as well earlier and this was the case too

@technicallyty technicallyty changed the base branch from feat/krakatoa to main March 11, 2026 17:17
@technicallyty technicallyty changed the base branch from main to feat/krakatoa March 11, 2026 17:17
@technicallyty technicallyty force-pushed the technicallyty/STACK-2402-krakatoa-recheck branch from f7e5c47 to c54ef7b Compare March 11, 2026 17:55
@technicallyty technicallyty changed the base branch from feat/krakatoa to main March 11, 2026 17:55
@aljo242
Copy link
Contributor

aljo242 commented Mar 12, 2026

@greptile re-review

@technicallyty
Copy link
Contributor Author

@greptile re-review

Comment on lines +20 to 32
return func(_ types.RunTx, request *abci.RequestCheckTx) (*abci.ResponseCheckTx, error) {
tx, err := mempool.txConfig.TxDecoder()(request.Tx)
if err != nil {
// detect if there is a nonce gap error (only returned for EVM transactions)
if errors.Is(err, ErrNonceGap) || errors.Is(err, ErrNonceLow) {
// send it to the mempool for further triage
err := mempool.InsertInvalidNonce(request.Tx)
if err != nil {
return sdkerrors.ResponseCheckTxWithEvents(err, gInfo.GasWanted, gInfo.GasUsed, anteEvents, false), nil
}
}
// If its already known, this can mean the the tx was promoted from nonce gap to valid
// and by allowing ErrAlreadyKnown to be silent, we allow re-gossiping of such txs
// this also covers the case of re-submission of the same tx enforcing overpricing for replacement
if errors.Is(err, txpool.ErrAlreadyKnown) {
return sdkerrors.ResponseCheckTxWithEvents(nil, gInfo.GasWanted, gInfo.GasUsed, anteEvents, false), nil
}

// anything else, return regular error
return sdkerrors.ResponseCheckTxWithEvents(err, gInfo.GasWanted, gInfo.GasUsed, anteEvents, false), nil
return sdkerrors.ResponseCheckTxWithEvents(err, 0, 0, nil, debug), nil
}

return &abci.ResponseCheckTx{
GasWanted: int64(gInfo.GasWanted), // #nosec G115 -- this is copied from the Cosmos SDK
GasUsed: int64(gInfo.GasUsed), // #nosec G115 -- this is copied from the Cosmos SDK
Log: result.Log,
Data: result.Data,
Events: types.MarkEventsToIndex(result.Events, nil),
}, nil
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
if err := mempool.Insert(ctx, tx); err != nil {
return sdkerrors.ResponseCheckTxWithEvents(err, 0, 0, nil, debug), nil
}
return &abci.ResponseCheckTx{Code: abci.CodeTypeOK}, nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recheck requests not short-circuited — breaks TestRecheckIsNoOp and risks tx eviction

The handler does not inspect request.Type, so abci.CheckTxType_Recheck requests take the same code path as new transactions.

Two concrete failure modes follow:

  1. TestRecheckIsNoOp will fail. The test submits []byte("not-a-real-tx") with CheckTxType_Recheck and asserts CodeTypeOK. With the current code the decoder will return an error and sdkerrors.ResponseCheckTxWithEvents will produce a non-CodeTypeOK response.

  2. Valid in-mempool transactions get evicted during recheck. When CometBFT fires a recheck for a tx that is already in the pool, mempool.Insert will return ErrAlreadyKnown (non-OK). CometBFT interprets a non-OK recheck response as the transaction having become invalid and removes it from its tracking — silently draining the mempool.

Since the app sets OperateExclusively = true, CometBFT delegates full mempool management to the application layer. Recheck is therefore a no-op from CometBFT's perspective and should always return CodeTypeOK without hitting the insert worker:

return func(_ types.RunTx, request *abci.RequestCheckTx) (*abci.ResponseCheckTx, error) {
    if request.Type == abci.CheckTxType_Recheck {
        return &abci.ResponseCheckTx{Code: abci.CodeTypeOK}, nil
    }
    tx, err := mempool.txConfig.TxDecoder()(request.Tx)
    ...
}

@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

❌ Patch coverage is 69.23077% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.27%. Comparing base (bea1ab9) to head (9dd5c85).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
mempool/check_tx.go 53.84% 1 Missing and 5 partials ⚠️
server/server_app_options.go 61.53% 2 Missing and 3 partials ⚠️
mempool/tx_store.go 90.90% 1 Missing and 1 partial ⚠️
server/config/config.go 33.33% 2 Missing ⚠️
server/start.go 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1062      +/-   ##
==========================================
+ Coverage   64.11%   65.27%   +1.16%     
==========================================
  Files         331      331              
  Lines       23303    23302       -1     
==========================================
+ Hits        14941    15211     +270     
+ Misses       6946     6920      -26     
+ Partials     1416     1171     -245     
Files with missing lines Coverage Δ
server/flags/flags.go 0.00% <ø> (ø)
server/start.go 0.00% <0.00%> (ø)
mempool/tx_store.go 88.67% <90.90%> (-11.33%) ⬇️
server/config/config.go 41.78% <33.33%> (-5.12%) ⬇️
server/server_app_options.go 41.74% <61.53%> (-13.81%) ⬇️
mempool/check_tx.go 36.36% <53.84%> (-52.53%) ⬇️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants